This notebook features an introduction to PixieDust, the Python library that makes data visualization easy.
This notebook runs on Python 2.7 and 3.5, with Spark 1.6 and 2.0.
This introduction is pretty straightforward, but it wouldn't hurt to load up the PixieDust documentation so it's handy.
New to notebooks? Don't worry. Here's all you need to know to run this introduction:
In [ ]:
# To confirm you have the latest version of PixieDust on your system, run this cell
!pip install --user --upgrade pixiedust
Now that you have PixieDust installed and up-to-date on your system, you need to import it into this notebook. This is the last dependency before you can play with PixieDust.
In [1]:
import pixiedust
If you get a message telling you that you're not running the latest version of PixieDust, restart the kernel from the Kernel menu and rerun the import pixiedust
command. (Any time you restart the kernel, rerun the import pixiedust
command.)
When you see the message Pixiedust version upgraded from 0.60 to 1.0.2
, or Pixiedust version 1.0.2
, you're all set.
In [3]:
# Build the SQL context required to create a Spark dataframe
sqlContext=SQLContext(sc)
# Create the Spark dataframe, passing in some data, and assign it to a variable
df = sqlContext.createDataFrame(
[("Green", 75),
("Blue", 25)],
["Colors","%"])
The data in the variable df
is ready to be visualized, without any further code other than the call to display()
.
In [4]:
# display the dataframe above as a pie chart
display(df)
After running the cell above, you should see a Spark DataFrame displayed as a pie chart, along with some controls to tweak the display. All that came from passing the DataFrame variable to display()
.
In the next cell, you'll pass more interesting data to display()
, which will also offer more advanced controls.
In [4]:
# create another DataFrame, in a new variable
df2 = sqlContext.createDataFrame(
[(2010, 'Camping Equipment', 3),
(2010, 'Golf Equipment', 1),
(2010, 'Mountaineering Equipment', 1),
(2010, 'Outdoor Protection', 2),
(2010, 'Personal Accessories', 2),
(2011, 'Camping Equipment', 4),
(2011, 'Golf Equipment', 5),
(2011, 'Mountaineering Equipment',2),
(2011, 'Outdoor Protection', 4),
(2011, 'Personal Accessories', 2),
(2012, 'Camping Equipment', 5),
(2012, 'Golf Equipment', 5),
(2012, 'Mountaineering Equipment', 3),
(2012, 'Outdoor Protection', 5),
(2012, 'Personal Accessories', 3),
(2013, 'Camping Equipment', 8),
(2013, 'Golf Equipment', 5),
(2013, 'Mountaineering Equipment', 3),
(2013, 'Outdoor Protection', 8),
(2013, 'Personal Accessories', 4)],
["year","category","unique_customers"])
# This time, we've combined the dataframe and display() call in the same cell
# Run this cell
display(df2)
The chart above, like the first one, is rendered by matplotlib. With PixieDust, you have other options. To toggle between renderers, use the Renderers
control at top right of the display output:
Options
button to explore other display configurations; for example, clustering and aggregation.Here's more on customizing display()
output.
In [5]:
# load a CSV with pixiedust.sampledata()
df3 = pixiedust.sampleData("https://github.com/ibm-watson-data-lab/open-data/raw/master/cars/cars.csv")
display(df3)
You should see a scatterplot above, rendered again by matplotlib. Find the Renderer
menu at top-right. You should see options for Bokeh and Seaborn. If you don't see Seaborn, it's not installed on your system. No problem, just install it by running the next cell.
In [7]:
# To install Seaborn, uncomment the next line, and then run this cell
#!pip install --user seaborn
If you installed Seaborn, you'll need to also restart your notebook kernel, and run the cell to import pixiedust
again. Find Restart in the Kernel menu above.
End of chapter. Return to table of contents
Data files commonly reside in remote sources, such as such as public or private market places or GitHub repositories. You can load comma separated value (csv) data files using Pixiedust's sampleData
method.
If you haven't already, import PixieDust. Follow the instructions in Get started.
In [6]:
pixiedust.enableJobMonitor()
In [ ]:
homes = pixiedust.sampleData("https://openobjectstore.mybluemix.net/misc/milliondollarhomes.csv")
The pixiedust.sampleData
method loads the data into an Apache Spark DataFrame, which you can inspect and visualize using display()
.
In [8]:
display(homes)
With PixieDust display()
, you can visually explore the loaded data using built-in charts, such as, bar charts, line charts, scatter plots, or maps.
To explore a data set:
You can analyze the average home price for each city by choosing:
CITY
PRICE
AVG
Run the next cell to review the results.
In [9]:
display(homes)
In [10]:
display(homes)
In [11]:
pixiedust.sampleData()
The homes sales data set you loaded earlier is one of the samples. Therefore, you could have loaded it by specifying the displayed data set id as parameter: home = pixiedust.sampleData(6)
If your data isn't stored in csv files, you can load it into a DataFrame from any supported Spark data source. See these Python code snippets for more information.
End of chapter. Return to table of contents
Python has a rich ecosystem of modules including plotting with matplotlib, data structure and analysis with pandas, machine learning, and natural language processing. However, data scientists working with Spark might occasionally need to call out code written in Scala or Java, for example, one of the hundreds of libraries available on spark-packages.org
. Unfortunately, Jupyter Python notebooks do not currently provide a way to call out Scala or Java code. As a result, a typical workaround is to first use a Scala notebook to run the Scala code, persist the output somewhere like a Hadoop Distributed File System, create another Python notebook, and re-load the data. This is obviously inefficent and awkward.
As you'll see in this notebook, PixieDust provides a solution to this problem by letting users write and run scala code directly in its own cell. It also lets variables be shared between Python and Scala and vice-versa.
In [12]:
pythonString = "Hello From Python"
pythonInt = 20
If you haven't already, import PixieDust. Follow the instructions in Get started.
PixieDust makes all variables defined in the Python scope available to Scala using the following rules:
The PixieDust Scala Bridge requires the environment variable SCALA_HOME to be defined and pointing at a Scala install:
In [13]:
%%scala
print(pythonString)
print(pythonInt + 10)
In [14]:
%%scala
//Reuse the sqlContext object available in the python scope
val c = sqlContext.asInstanceOf[org.apache.spark.sql.SQLContext]
import c.implicits._
val __dfFromScala = Seq(
(2010, "Camping Equipment", 3, 200),
(2010, "Golf Equipment", 1, 240),
(2010, "Mountaineering Equipment", 1, 348),
(2010, "Outdoor Protection", 2, 200),
(2010, "Personal Accessories", 2, 200),
(2011, "Camping Equipment", 4, 489),
(2011, "Golf Equipment", 5, 234),
(2011, "Mountaineering Equipment",2, 123),
(2011, "Outdoor Protection", 4, 654),
(2011, "Personal Accessories", 2, 234),
(2012, "Camping Equipment", 5, 876),
(2012, "Golf Equipment", 5, 200),
(2012, "Mountaineering Equipment", 3, 156),
(2012, "Outdoor Protection", 5, 200),
(2012, "Personal Accessories", 3, 345),
(2013, "Camping Equipment", 8, 987),
(2013, "Golf Equipment", 5, 434),
(2013, "Mountaineering Equipment", 3, 278),
(2013, "Outdoor Protection", 8, 134),
(2013, "Personal Accessories", 4, 200)).toDF("year", "zone", "unique_customers", "revenue")
print(__dfFromScala)
In [15]:
display(__dfFromScala)
In this chapter, you've seen how easy it is to intersperse Scala and Python in the same notebook. Continue exploring this powerful functionality by using more complex Scala libraries!
End of chapter. Return to table of contents
PixieDust PackageManager helps you install spark packages inside your notebook. This is especially useful when you're working in a hosted cloud environment without access to configuration files. Use PixieDust Package Manager to install:
spark-packages.org
Note: After you install a package, you must restart the kernel and import Pixiedust again.
In [2]:
import pixiedust
pixiedust.printAllPackages()
In [3]:
# For Spark 2.0, uncomment and run the next line
#pixiedust.installPackage("graphframes:graphframes:0")
# For Spark 1.6, uncomment and run the next line
#pixiedust.installPackage("graphframes:graphframes:0.1.0-spark1.6")
Out[3]:
In [7]:
pixiedust.printAllPackages()
GraphGrames comes with sample data sets. Even if GraphFrames is already installed, running the install command loads the Python that comes along with the package and enables features like the one you're about to see. Run the following cell and PixieDust displays a sample graph data set called friends. On the upper left of the display, click the table dropdown and switch between views of nodes and edges.
In [8]:
#import the Graphs example
from graphframes.examples import Graphs
#create the friends example graph
g=Graphs(sqlContext).friends()
#use the pixiedust display
display(g)
To install a package from the Apache Maven search repository, visit the project and find the groupId
and artifactId
for the package that you want. Enter them in the following installation command. See instructions for the installPackage command. For example, the following cell installs Apache Commons:
In [7]:
pixiedust.installPackage("org.apache.commons:commons-csv:0")
Out[7]:
In [9]:
pixiedust.installPackage("https://github.com/ibm-watson-data-lab/spark.samples/raw/master/dist/streaming-twitter-assembly-1.6.jar")
Out[9]:
To understand what you can do with this jar file, read David Taieb's latest Realtime Sentiment Analysis of Twitter Hashtags with Spark tutorial.
In [3]:
pixiedust.uninstallPackage("org.apache.commons:commons-csv:0")
End of chapter. Return to table of contents
You save the data directly into a Cloudant or CouchDB database.
Prerequisite: Collect your database connection information: the database host, user name, and password.
If your Cloudant instance was provisioned in Bluemix, you can find the connectivity information in the Service Credentials tab.
To stash to Cloudant:
display
output, click the Download button. +
plus button to add a new connection.{
"name": "local-couchdb-connection",
"credentials": {
"username": "couchdbuser",
"password": "password",
"protocol": "http",
"host": "127.0.0.1:5984",
"port": 5984,
"url": "http://couchdbuser:password@127.0.0.1:5984"
}
}
{
"name": "remote-cloudant-connection",
"credentials": {
"username": "username-bluemix",
"password": "password",
"host": "host-bluemix.cloudant.com",
"port": 443,
"url": "https://username-bluemix:password@host-bluemix.cloudant.com"
}
}
Alternatively, you can choose to save the data set to various file formats (for example, CSV, JSON, XML, and so on).
To save a data set as a file:
display
output, click the Download button.End of chapter. Return to table of contents
By now, you've walked through PixieDust's intro notebooks and seen PixieDust in action. If you like what you saw, join the project!
Anyone can get involved. Here are some ways you can contribute:
Contribute your own custom visualization. Here's a taste of how it works.
Run the next 4 cells to do the following:
This is just one small example you can quickly do within this notebook. Read how to create a custom visualization.
In [ ]:
import pixiedust
Now, create a simple DataFrame:
In [ ]:
sqlContext=SQLContext(sc)
d1 = sqlContext.createDataFrame(
[(2010, 'Camping Equipment', 3),
(2010, 'Golf Equipment', 1),
(2010, 'Mountaineering Equipment', 1),
(2010, 'Outdoor Protection', 2),
(2010, 'Personal Accessories', 2),
(2011, 'Camping Equipment', 4),
(2011, 'Golf Equipment', 5),
(2011, 'Mountaineering Equipment',2),
(2011, 'Outdoor Protection', 4),
(2011, 'Personal Accessories', 2),
(2012, 'Camping Equipment', 5),
(2012, 'Golf Equipment', 5),
(2012, 'Mountaineering Equipment', 3),
(2012, 'Outdoor Protection', 5),
(2012, 'Personal Accessories', 3),
(2013, 'Camping Equipment', 8),
(2013, 'Golf Equipment', 5),
(2013, 'Mountaineering Equipment', 3),
(2013, 'Outdoor Protection', 8),
(2013, 'Personal Accessories', 4)],
["year","zone","unique_customers"])
The following cell creates a new custom table visualization plugin called NewSample:
In [ ]:
from pixiedust.display.display import *
class TestDisplay(Display):
def doRender(self, handlerId):
self._addHTMLTemplateString(
"""
NewSample Plugin
<table class="table table-striped">
<thead>
{%for field in entity.schema.fields%}
<th>{{field.name}}</th>
{%endfor%}
</thead>
<tbody>
{%for row in entity.take(100)%}
<tr>
{%for field in entity.schema.fields%}
<td>{{row[field.name]}}</td>
{%endfor%}
</tr>
{%endfor%}
</tbody>
</table>
"""
)
@PixiedustDisplay()
class TestPluginMeta(DisplayHandlerMeta):
@addId
def getMenuInfo(self,entity,dataHandler):
if entity.__class__.__name__ == "DataFrame":
return [
{"categoryId": "Table", "title": "NewSample Table", "icon": "fa-table", "id": "newsampleTest"}
]
else:
return []
def newDisplayHandler(self,options,entity):
return TestDisplay(options,entity)
Next, run display()
to show the data. Click the Table dropdown. You now see NewSample Table option, the custom visualization you just created!
In [ ]:
display(d1)
Error? If you changed the name yourself in cell 3, you might get an error when you try to display. You can fix this by updating metadata in the display() cell. To do so, go to the Jupyter menu above the notebook and choose View > Cell Toolbar > Edit Metadata. Then scroll down to the display(dl)
cell, click its Edit Metadata button and change the handlerID
.
PixieDust lets you switch between renderers for charts and maps. We'd love to add more to the list. It's easy to get started. Try the generate
tool to create a boilerplate renderer using a quick CLI wizard. Read how to build a renderer.
Found a bug? Thought of great enhancement? Enter an issue to let us know. Tell us what you think.
Ready to pitch in? We can't wait to see what you share. More on how to contribute.
End of chapter. Return to table of contents
In [ ]: